Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

IDa-Det: An Information Discrepancy-Aware Distillation for 1-bit Detectors

173

(a) VOC trainval0712

(b) VOC test2007

(d) COCO minival

FIGURE 6.14

The Mahalanobis distance of the gradient in the intermediate neck feature between Res101-

Res18 (gathering on the left) and Res101-BiRes18 (uniformly dispersed) in various datasets.

proposal saliency maps of Res101 and Res18 (blue) is much smaller than that of Res101

and BiRes18 (orange). That is to say, the smaller the distance, the smaller the discrepancy.

Brieﬂy, conventional KD methods show their eﬀectiveness in distilling real-valued detectors,

but seem to be less eﬀective on distilling 1-bit detectors.

We are motivated by the observation above and present an information discrepancy-

aware distillation for 1-bit detectors (IDa-Det) [260]. This can eﬀectively address the infor-

mation discrepancy problem, leading to an eﬃcient distillation process. As shown in Fig.

6.15, we introduce a discrepancy-aware method to select proposal pairs and facilitate dis-

tilling 1-bit detectors, rather than only using object anchor locations of student models or

ground truth as in existing methods [235, 264, 79]. We further introduce a novel entropy dis-

tillation loss to leverage more comprehensive information than conventional loss functions.

By doing so, we achieve a powerful information discrepancy-aware distillation method for

1-bit detectors (IDa-Det).

Real-valued Teacher

1-bit Student

Object Region

False Positive

Missed Detection

Information

discrepancy

Entropy

distillation loss

Proposal distribution

(Channel-wise Gaussian distribution)

߮ሺڄሻ

FIGURE 6.15

Overview of the proposed information discrepancy-aware distillation (IDa-Det) framework.

We ﬁrst select representative proposal pairs based on the information discrepancy. Then we

propose the entropy distillation loss to eliminate the information discrepancy.